Feature engineering for electricity load forecasting#

The purpose of this notebook is to demonstrate how to use skrub and polars to perform feature engineering for electricity load forecasting.

We will build a set of features (and targets) from different data sources:

  • Historical weather data for 10 medium to large urban areas in France;

  • Holidays and standard calendar features for France;

  • Historical electricity load data for the whole of France.

All these data sources cover a time range from March 23, 2021 to May 31, 2025.

Since our maximum forecasting horizon is 24 hours, we consider that the future weather data is known at a chosen prediction time. Similarly, the holidays and calendar features are known at prediction time for any point in the future.

Therefore, exogenous features derived from the weather and calendar data can be used to engineer “future covariates”. Since the load data is our prediction target, we will can also use it to engineer “past covariates” such as lagged features and rolling aggregations. The future values of the load data (with respect to the prediction time) are used as targets for the forecasting model.

Environment setup#

We need to install some extra dependencies for this notebook if needed (when running jupyterlite). We need the development version of skrub to be able to use the skrub expressions.

%pip install -q https://pypi.anaconda.org/ogrisel/simple/polars/1.24.0/polars-1.24.0-cp39-abi3-emscripten_3_1_58_wasm32.whl
%pip install -q https://pypi.anaconda.org/ogrisel/simple/skrub/0.6.dev0/skrub-0.6.dev0-py3-none-any.whl
%pip install -q altair holidays plotly nbformat
ERROR: polars-1.24.0-cp39-abi3-emscripten_3_1_58_wasm32.whl is not a supported wheel on this platform.

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

The following 3 imports are only needed to workaround some limitations when using polars in a pyodide/jupyterlite notebook.

TODO: remove those workarounds once pyodide 0.28 is released with support for the latest polars version.

import tzdata  # noqa: F401
import pandas as pd
from pyarrow.parquet import read_table

import altair
import numpy as np
import polars as pl
import skrub
from pathlib import Path
import holidays
import warnings

# Ignore warnings from pkg_resources triggered by Python 3.13's multiprocessing.
warnings.filterwarnings("ignore", category=UserWarning, module="pkg_resources")

Shared time range for all historical data sources#

Let’s define a hourly time range from March 23, 2021 to May 31, 2025 that will be used to join the electricity load data and the weather data. The time range is in UTC timezone to avoid any ambiguity when joining with the weather data that is also in UTC.

We wrap the resulting polars dataframe in a skrub expression to benefit from the built-in skrub.TableReport display in the notebook. Using the skrub expression system will also be useful for other reasons: all operations in this notebook chain operations chained together in a directed acyclic graph that is automatically tracked by skrub. This allows us to extract the resulting pipeline and apply it to new data later on, exactly like a trained scikit-learn pipeline. The main difference is that we do so incrementally and while eagerly executing and inspecting the results of each step as is customary when working with dataframe libraries such as polars and pandas in Jupyter notebooks.

historical_data_start_time = skrub.var(
    "historical_data_start_time", pl.datetime(2021, 3, 23, hour=0, time_zone="UTC")
)
historical_data_end_time = skrub.var(
    "historical_data_end_time", pl.datetime(2025, 5, 31, hour=23, time_zone="UTC")
)
@skrub.deferred
def build_historical_time_range(
    historical_data_start_time,
    historical_data_end_time,
    time_interval="1h",
    time_zone="UTC",
):
    """Define an historical time range shared by all data sources."""
    return pl.DataFrame().with_columns(
        pl.datetime_range(
            start=historical_data_start_time,
            end=historical_data_end_time,
            time_zone=time_zone,
            interval=time_interval,
        ).alias("time"),
    )


time = build_historical_time_range(historical_data_start_time, historical_data_end_time)
time
<Call 'build_historical_time_range'>
Show graph Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

If you run the above locally with pydot and graphviz installed, you can visualize the expression graph of the time variable by expanding the “Show graph” button.

Let’s now load the data records for the time range defined above.

To avoid network issues when running this notebook, the necessary data files have already been downloaded and saved in the datasets folder. See the README.md file for instructions to download the data manually if you want to re-run this notebook with more recent data.

data_source_folder = skrub.var("data_source_folder", Path("../datasets"))

for data_file in sorted(data_source_folder.skb.eval().iterdir()):
    print(data_file)
../datasets/README.md
../datasets/Total Load - Day Ahead _ Actual_202101010000-202201010000.csv
../datasets/Total Load - Day Ahead _ Actual_202201010000-202301010000.csv
../datasets/Total Load - Day Ahead _ Actual_202301010000-202401010000.csv
../datasets/Total Load - Day Ahead _ Actual_202401010000-202501010000.csv
../datasets/Total Load - Day Ahead _ Actual_202501010000-202601010000.csv
../datasets/weather_bayonne.parquet
../datasets/weather_brest.parquet
../datasets/weather_lille.parquet
../datasets/weather_limoges.parquet
../datasets/weather_lyon.parquet
../datasets/weather_marseille.parquet
../datasets/weather_nantes.parquet
../datasets/weather_paris.parquet
../datasets/weather_strasbourg.parquet
../datasets/weather_toulouse.parquet

We define a list of 10 medium to large urban areas to approximately cover most regions in France with a slight focus on most populated regions that are likely to drive electricity demand.

city_names = skrub.var(
    "city_names",
    [
        "paris",
        "lyon",
        "marseille",
        "toulouse",
        "lille",
        "limoges",
        "nantes",
        "strasbourg",
        "brest",
        "bayonne",
    ],
)


@skrub.deferred
def load_weather_data(time, city_names, data_source_folder):
    """Load and horizontal stack historical weather forecast data for each city."""
    all_city_weather = time
    for city_name in city_names:
        all_city_weather = all_city_weather.join(
            pl.from_arrow(
                read_table(f"{data_source_folder}/weather_{city_name}.parquet")
            )
            .with_columns([pl.col("time").dt.cast_time_unit("us")])
            .rename(lambda x: x if x == "time" else "weather_" + x + "_" + city_name),
            on="time",
        )
    return all_city_weather


all_city_weather = load_weather_data(time, city_names, data_source_folder)
all_city_weather
<Call 'load_weather_data'>
Show graph Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_weather_data' Var 'city_names' Var 'data_source_folder'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Calendar and holidays features#

We leverage the holidays package to enrich the time range with some calendar features such as public holidays in France. We also add some features that are useful for time series forecasting such as the day of the week, the day of the year, and the hour of the day.

Note that the holidays package requires us to extract the date for the French timezone.

Similarly for the calendar features: all the time features are extracted from the time in the French timezone, since it is likely that electricity usage patterns are influenced by inhabitants’ daily routines aligned with the local timezone.

@skrub.deferred
def prepare_french_calendar_data(time):
    fr_time = pl.col("time").dt.convert_time_zone("Europe/Paris")
    fr_year_min = time.select(fr_time.dt.year().min()).item()
    fr_year_max = time.select(fr_time.dt.year().max()).item()
    holidays_fr = holidays.country_holidays(
        "FR", years=range(fr_year_min, fr_year_max + 1)
    )
    return time.with_columns(
        [
            fr_time.dt.hour().alias("cal_hour_of_day"),
            fr_time.dt.weekday().alias("cal_day_of_week"),
            fr_time.dt.ordinal_day().alias("cal_day_of_year"),
            fr_time.dt.year().alias("cal_year"),
            fr_time.dt.date().is_in(holidays_fr.keys()).alias("cal_is_holiday"),
        ],
    )


calendar = prepare_french_calendar_data(time)
calendar
<Call 'prepare_french_calendar_data'>
Show graph Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'prepare_french_calendar_data'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Electricity load data#

Finally we load the electricity load data. This data will both be used as a target variable but also to craft some lagged and window-aggregated features.

@skrub.deferred
def load_electricity_load_data(time, data_source_folder):
    """Load and aggregate historical load data from the raw CSV files."""
    load_data_files = [
        data_file
        for data_file in sorted(data_source_folder.iterdir())
        if data_file.name.startswith("Total Load - Day Ahead")
        and data_file.name.endswith(".csv")
    ]
    return time.join(
        (
            pl.concat(
                [
                    pl.from_pandas(pd.read_csv(data_file, na_values=["N/A", "-"])).drop(
                        ["Day-ahead Total Load Forecast [MW] - BZN|FR"]
                    )
                    for data_file in load_data_files
                ]
            ).select(
                [
                    pl.col("Time (UTC)")
                    .str.split(by=" - ")
                    .list.first()
                    .str.to_datetime("%d.%m.%Y %H:%M", time_zone="UTC")
                    .alias("time"),
                    pl.col("Actual Total Load [MW] - BZN|FR").alias("load_mw"),
                ]
            )
        ),
        on="time",
    )


electricity = load_electricity_load_data(time, data_source_folder)
electricity
<Call 'load_electricity_load_data'>
Show graph Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Var 'data_source_folder'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

electricity.filter(pl.col("load_mw").is_null())
<CallMethod 'filter'>
Show graph Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Var 'data_source_folder' CallMethod 'filter'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

electricity.filter(
    (pl.col("time") > pl.datetime(2021, 10, 30, hour=10, time_zone="UTC"))
    & (pl.col("time") < pl.datetime(2021, 10, 31, hour=10, time_zone="UTC"))
).skb.eval().plot.line(x="time:T", y="load_mw:Q")
electricity = electricity.with_columns([pl.col("load_mw").interpolate()])
electricity.filter(
    (pl.col("time") > pl.datetime(2021, 10, 30, hour=10, time_zone="UTC"))
    & (pl.col("time") < pl.datetime(2021, 10, 31, hour=10, time_zone="UTC"))
).skb.eval().plot.line(x="time:T", y="load_mw:Q")

Lagged features#

We can now create some lagged features from the electricity load data.

We will create 3 hourly lagged features, 1 daily lagged feature, and 1 weekly lagged feature. We will also create a rolling median and inter-quartile feature over the last 24 hours and over the last 7 days.

def iqr(col, *, window_size: int):
    """Inter-quartile range (IQR) of a column."""
    return col.rolling_quantile(0.75, window_size=window_size) - col.rolling_quantile(
        0.25, window_size=window_size
    )


electricity_lagged = electricity.with_columns(
    [pl.col("load_mw").shift(i).alias(f"load_mw_lag_{i}h") for i in range(1, 4)]
    + [
        pl.col("load_mw").shift(24).alias("load_mw_lag_1d"),
        pl.col("load_mw").shift(24 * 7).alias("load_mw_lag_1w"),
        pl.col("load_mw")
        .rolling_median(window_size=24)
        .alias("load_mw_rolling_median_24h"),
        pl.col("load_mw")
        .rolling_median(window_size=24 * 7)
        .alias("load_mw_rolling_median_7d"),
        iqr(pl.col("load_mw"), window_size=24).alias("load_mw_iqr_24h"),
        iqr(pl.col("load_mw"), window_size=24 * 7).alias("load_mw_iqr_7d"),
    ],
)
electricity_lagged
<CallMethod 'with_columns'>
Show graph Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Var 'data_source_folder' CallMethod 'with_columns' CallMethod 'with_columns'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

altair.Chart(electricity_lagged.tail(100).skb.eval()).transform_fold(
    [
        "load_mw",
        "load_mw_lag_1h",
        "load_mw_lag_2h",
        "load_mw_lag_3h",
        "load_mw_lag_1d",
        "load_mw_lag_1w",
        "load_mw_rolling_median_24h",
        "load_mw_rolling_median_7d",
        "load_mw_rolling_iqr_24h",
        "load_mw_rolling_iqr_7d",
    ],
    as_=["key", "load_mw"],
).mark_line(tooltip=True).encode(x="time:T", y="load_mw:Q", color="key:N").interactive()

Remark lagged features engineering and system lag#

When working with historical data, we often have access to all the past measurements in the dataset. However, when we want to use the lagged features in a forecasting model, we need to be careful about the length of the system lag: the time between a timestamped measurement is made in the real world and the time the record is made available to the downstream application (in our case, a deployed predictive pipeline).

System lag is rarely explicitly represented in the data sources even if such delay can be as large as several hours or even days and can sometimes be irregular. For instance, if there is a human intervention in the data recording process, holidays and weekends can punctually add significant delay.

If the system lag is larger than the maximum feature engineering lag, the resulting features be filled with missing values once deployed. More importantly, if the system lag is not handled explicitly, those resulting missing values will only be present in the features computed for the deployed system but not present in the features computed to train and backtest the system before deployment.

This structural discrepancy can severely degrade the performance of the deployed model compared to the performance estimated from backtesting on the historical data.

We will set this problem aside for now but discuss it again in a later section of this tutorial.

Investigating outliers in the lagged features#

Let’s use the skrub.TableReport tool to look at the plots of the marginal distribution of the lagged features.

from skrub import TableReport

TableReport(electricity_lagged.skb.eval())
Processing column   1 / 11
Processing column   2 / 11
Processing column   3 / 11
Processing column   4 / 11
Processing column   5 / 11
Processing column   6 / 11
Processing column   7 / 11
Processing column   8 / 11
Processing column   9 / 11
Processing column  10 / 11
Processing column  11 / 11

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Let’s extract the dates where the inter-quartile range of the load is greater than 15,000 MW.

electricity_lagged.filter(pl.col("load_mw_iqr_7d") > 15_000)[
    "time"
].dt.date().unique().sort().to_list().skb.eval()
[datetime.date(2021, 12, 26),
 datetime.date(2021, 12, 27),
 datetime.date(2021, 12, 28),
 datetime.date(2022, 1, 7),
 datetime.date(2022, 1, 8),
 datetime.date(2023, 1, 19),
 datetime.date(2023, 1, 20),
 datetime.date(2023, 1, 21),
 datetime.date(2024, 1, 10),
 datetime.date(2024, 1, 11),
 datetime.date(2024, 1, 12),
 datetime.date(2024, 1, 13)]

We observe 3 date ranges with high inter-quartile range. Let’s plot the electricity load and the lagged features for the first data range along with the weather data for Paris.

altair.Chart(
    electricity_lagged.filter(
        (pl.col("time") > pl.datetime(2021, 12, 1, time_zone="UTC"))
        & (pl.col("time") < pl.datetime(2021, 12, 31, time_zone="UTC"))
    ).skb.eval()
).transform_fold(
    [
        "load_mw",
        "load_mw_iqr_7d",
    ],
).mark_line(
    tooltip=True
).encode(
    x="time:T", y="value:Q", color="key:N"
).interactive()
altair.Chart(
    all_city_weather.filter(
        (pl.col("time") > pl.datetime(2021, 12, 1, time_zone="UTC"))
        & (pl.col("time") < pl.datetime(2021, 12, 31, time_zone="UTC"))
    ).skb.eval()
).transform_fold(
    [f"weather_temperature_2m_{city_name}" for city_name in city_names.skb.eval()],
).mark_line(
    tooltip=True
).encode(
    x="time:T", y="value:Q", color="key:N"
).interactive()

Based on the plots above, we can see that the electricity load was high just before the Christmas holidays due to low temperatures. Then the load suddenly dropped because temperatures went higher right at the start of the end-of-year holidays.

So those outliers do not seem to be caused to a data quality issue but rather due to a real change in the electricity load demand. We could conduct similar analysis for the other date ranges with high inter-quartile range but we will skip that for now.

If we had observed significant data quality issues over extended periods of time could have been addressed by removing the corresponding rows from the dataset. However, this would make the lagged and windowing feature engineering challenging to reimplement correctly. A better approach would be to keep a contiguous dataset assign 0 weights to the affected rows when fitting or evaluating the trained models via the use of the sample_weight parameter.

Final dataset#

We now assemble the dataset that will be used to train and evaluate the forecasting models via backtesting.

prediction_start_time = skrub.var(
    "prediction_start_time", historical_data_start_time.skb.eval() + pl.duration(days=7)
)
prediction_end_time = skrub.var(
    "prediction_end_time", historical_data_end_time.skb.eval() - pl.duration(hours=24)
)


@skrub.deferred
def define_prediction_time_range(prediction_start_time, prediction_end_time):
    return pl.DataFrame().with_columns(
        pl.datetime_range(
            start=prediction_start_time,
            end=prediction_end_time,
            time_zone="UTC",
            interval="1h",
        ).alias("prediction_time"),
    )


prediction_time = define_prediction_time_range(
    prediction_start_time, prediction_end_time
)
prediction_time
<Call 'define_prediction_time_range'>
Show graph Var 'prediction_start_time' Call 'define_prediction_time_range' Var 'prediction_end_time'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

@skrub.deferred
def build_features(
    prediction_time,
    electricity_lagged,
    all_city_weather,
    calendar,
    future_feature_horizons=[1, 24],
):

    return (
        prediction_time.join(
            electricity_lagged, left_on="prediction_time", right_on="time"
        )
        .join(
            all_city_weather.select(
                [pl.col("time")]
                + [
                    pl.col(c).shift(-h).alias(c + f"_future_{h}h")
                    for c in all_city_weather.columns
                    if c != "time"
                    for h in future_feature_horizons
                ]
            ),
            left_on="prediction_time",
            right_on="time",
        )
        .join(
            calendar.select(
                [pl.col("time")]
                + [
                    pl.col(c).shift(-h).alias(c + f"_future_{h}h")
                    for c in calendar.columns
                    if c != "time"
                    for h in future_feature_horizons
                ]
            ),
            left_on="prediction_time",
            right_on="time",
        )
    ).drop("prediction_time")


features = build_features(
    prediction_time=prediction_time,
    electricity_lagged=electricity_lagged,
    all_city_weather=all_city_weather,
    calendar=calendar,
).skb.mark_as_X()

features
<Call 'build_features'>
Show graph Var 'prediction_start_time' Call 'define_prediction_time_range' Var 'prediction_end_time' X: Call 'build_features' Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Call 'load_weather_data' Call 'prepare_french_calendar_data' Var 'data_source_folder' CallMethod 'with_columns' CallMethod 'with_columns' Var 'city_names'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Let’s build training and evaluation targets for all possible horizons from 1 to 24 hours.

horizons = range(1, 25)
target_column_name_pattern = "load_mw_horizon_{horizon}h"


@skrub.deferred
def build_targets(prediction_time, electricity, horizons):
    return prediction_time.join(
        electricity.with_columns(
            [
                pl.col("load_mw")
                .shift(-h)
                .alias(target_column_name_pattern.format(horizon=h))
                for h in horizons
            ]
        ),
        left_on="prediction_time",
        right_on="time",
    )


targets = build_targets(prediction_time, electricity, horizons)
targets
<Call 'build_targets'>
Show graph Var 'prediction_start_time' Call 'define_prediction_time_range' Var 'prediction_end_time' Call 'build_targets' Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Var 'data_source_folder' CallMethod 'with_columns'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

For now, let’s focus on the last horizon (24 hours) to train a model predicting the electricity load at the next 24 hours.

horizon_of_interest = horizons[-1]  # Focus on the 24-hour horizon
target_column_name = target_column_name_pattern.format(horizon=horizon_of_interest)
predicted_target_column_name = "predicted_" + target_column_name
target = targets[target_column_name].skb.mark_as_y()
target
<GetItem 'load_mw_horizon_24h'>
Show graph Var 'prediction_start_time' Call 'define_prediction_time_range' Var 'prediction_end_time' Call 'build_targets' Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Var 'data_source_folder' CallMethod 'with_columns' y: GetItem 'load_mw_horizon_24h'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Let’s define our first single output prediction pipeline. This pipeline chains our previous feature engineering steps with a skrub.DropCols step to drop some columns that we do not want to use as features, and a HistGradientBoostingRegressor model from scikit-learn.

The skrub.choose_from, skrub.choose_float, and skrub.choose_int functions are used to define hyperparameters that can be tuned via cross-validated randomized search.

from sklearn.ensemble import HistGradientBoostingRegressor
import skrub.selectors as s


features_with_dropped_cols = features.skb.apply(
    skrub.DropCols(
        cols=skrub.choose_from(
            {
                "none": s.glob(""),  # No column has an empty name.
                "load": s.glob("load_*"),
                "rolling_load": s.glob("load_mw_rolling_*"),
                "weather": s.glob("weather_*"),
                "temperature": s.glob("weather_temperature_*"),
                "moisture": s.glob("weather_moisture_*"),
                "cloud_cover": s.glob("weather_cloud_cover_*"),
                "calendar": s.glob("cal_*"),
                "holiday": s.glob("cal_is_holiday*"),
                "future_1h": s.glob("*_future_1h"),
                "future_24h": s.glob("*_future_24h"),
                "non_paris_weather": s.glob("weather_*") & ~s.glob("weather_*_paris_*"),
            },
            name="dropped_cols",
        )
    )
)

hgbr_predictions = features_with_dropped_cols.skb.apply(
    HistGradientBoostingRegressor(
        random_state=0,
        loss=skrub.choose_from(["squared_error", "poisson", "gamma"], name="loss"),
        learning_rate=skrub.choose_float(
            0.01, 1, default=0.1, log=True, name="learning_rate"
        ),
        max_leaf_nodes=skrub.choose_int(
            3, 300, default=30, log=True, name="max_leaf_nodes"
        ),
    ),
    y=target,
)
hgbr_predictions
<Apply HistGradientBoostingRegressor>
Show graph Var 'prediction_start_time' Call 'define_prediction_time_range' Var 'prediction_end_time' X: Call 'build_features' Call 'build_targets' Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Call 'load_weather_data' Call 'prepare_french_calendar_data' Var 'data_source_folder' CallMethod 'with_columns' CallMethod 'with_columns' Var 'city_names' Apply DropCols Apply HistGradientBoostingRegressor y: GetItem 'load_mw_horizon_24h'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The predictions expression captures the whole expression graph that includes the feature engineering steps, the target variable, and the model training step.

In particular, the input data keys for the full pipeline can be inspected as follows:

hgbr_predictions.skb.get_data().keys()
dict_keys(['prediction_start_time', 'prediction_end_time', 'historical_data_start_time', 'historical_data_end_time', 'data_source_folder', 'city_names'])

Furthermore, the hyper-parameters of the full pipeline can be retrieved as follows:

hgbr_pipeline = hgbr_predictions.skb.get_pipeline()
hgbr_pipeline.describe_params()
{'dropped_cols': 'none',
 'learning_rate': 0.1,
 'loss': 'squared_error',
 'max_leaf_nodes': 30}

When running this notebook locally, you can also interactively inspect all the steps of the DAG using the following (once uncommented):

# predictions.skb.full_report()

Since we passed input values to all the upstream skrub variables, skrub automatically evaluates the whole expression graph graph (train and predict on the same data) so that we can interactively check that everything will work as expected.

Let’s use altair to visualize the predictions against the target values for the last 24 hours of the prediction time range used to train the model. This allows us can (over)fit the data with the features at hand.

altair.Chart(
    pl.concat(
        [
            targets.skb.eval(),
            hgbr_predictions.rename(
                {target_column_name: predicted_target_column_name}
            ).skb.eval(),
        ],
        how="horizontal",
    ).tail(24 * 7)
).transform_fold(
    [target_column_name, predicted_target_column_name],
).mark_line(
    tooltip=True
).encode(
    x="prediction_time:T", y="value:Q", color="key:N"
).interactive()

Assessing the model performance via cross-validation#

Being able to fit the training data is not enough. We need to assess the ability of the training pipeline to learn a predictive model that can generalize to unseen data.

Furthermore, we want to assess the uncertainty of this estimate of the generalization performance via time-based cross-validation, also known as backtesting.

from sklearn.model_selection import TimeSeriesSplit


max_train_size = 2 * 52 * 24 * 7  # max ~2 years of training data
test_size = 24 * 7 * 24  # 24 weeks of test data
gap = 7 * 24  # 1 week gap between train and test sets
ts_cv_5 = TimeSeriesSplit(
    n_splits=5, max_train_size=max_train_size, test_size=test_size, gap=gap
)

for fold_idx, (train_idx, test_idx) in enumerate(
    ts_cv_5.split(prediction_time.skb.eval())
):
    print(f"CV iteration #{fold_idx}")
    train_datetimes = prediction_time.skb.eval()[train_idx]
    test_datetimes = prediction_time.skb.eval()[test_idx]
    print(
        f"Train: {train_datetimes.shape[0]} rows, "
        f"Test: {test_datetimes.shape[0]} rows"
    )
    print(f"Train time range: {train_datetimes[0, 0]} to " f"{train_datetimes[-1, 0]} ")
    print(f"Test time range: {test_datetimes[0, 0]} to " f"{test_datetimes[-1, 0]} ")
    print()
CV iteration #0
Train: 16224 rows, Test: 4032 rows
Train time range: 2021-03-30 00:00:00+00:00 to 2023-02-03 23:00:00+00:00 
Test time range: 2023-02-11 00:00:00+00:00 to 2023-07-28 23:00:00+00:00 

CV iteration #1
Train: 17472 rows, Test: 4032 rows
Train time range: 2021-07-24 00:00:00+00:00 to 2023-07-21 23:00:00+00:00 
Test time range: 2023-07-29 00:00:00+00:00 to 2024-01-12 23:00:00+00:00 

CV iteration #2
Train: 17472 rows, Test: 4032 rows
Train time range: 2022-01-08 00:00:00+00:00 to 2024-01-05 23:00:00+00:00 
Test time range: 2024-01-13 00:00:00+00:00 to 2024-06-28 23:00:00+00:00 

CV iteration #3
Train: 17472 rows, Test: 4032 rows
Train time range: 2022-06-25 00:00:00+00:00 to 2024-06-21 23:00:00+00:00 
Test time range: 2024-06-29 00:00:00+00:00 to 2024-12-13 23:00:00+00:00 

CV iteration #4
Train: 17472 rows, Test: 4032 rows
Train time range: 2022-12-10 00:00:00+00:00 to 2024-12-06 23:00:00+00:00 
Test time range: 2024-12-14 00:00:00+00:00 to 2025-05-30 23:00:00+00:00 
from sklearn.metrics import make_scorer, mean_absolute_percentage_error, get_scorer
from sklearn.metrics import d2_tweedie_score


hgbr_cv_results = hgbr_predictions.skb.cross_validate(
    cv=ts_cv_5,
    scoring={
        "mape": make_scorer(mean_absolute_percentage_error),
        "r2": get_scorer("r2"),
        "d2_poisson": make_scorer(d2_tweedie_score, power=1.0),
        "d2_gamma": make_scorer(d2_tweedie_score, power=2.0),
    },
    return_train_score=True,
    return_pipeline=True,
    verbose=1,
    n_jobs=-1,
)
hgbr_cv_results.round(3)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    8.4s finished
fit_time score_time test_mape train_mape test_r2 train_r2 test_d2_poisson train_d2_poisson test_d2_gamma train_d2_gamma pipeline
0 2.957 0.061 0.027 0.012 0.963 0.994 0.962 0.994 0.961 0.994 SkrubPipeline(expr=<Apply HistGradientBoosting...
1 3.244 0.059 0.024 0.013 0.978 0.994 0.977 0.994 0.976 0.993 SkrubPipeline(expr=<Apply HistGradientBoosting...
2 3.204 0.060 0.023 0.014 0.974 0.993 0.974 0.993 0.975 0.992 SkrubPipeline(expr=<Apply HistGradientBoosting...
3 3.209 0.059 0.019 0.014 0.980 0.993 0.980 0.992 0.980 0.992 SkrubPipeline(expr=<Apply HistGradientBoosting...
4 2.145 0.037 0.023 0.014 0.977 0.993 0.978 0.992 0.978 0.992 SkrubPipeline(expr=<Apply HistGradientBoosting...
def collect_cv_predictions(pipelines, cv_splitter, predictions, prediction_time):
    index_generator = cv_splitter.split(prediction_time.skb.eval())

    def splitter(X, y, index_generator):
        """Workaround to transform a scikit-learn splitter into a function understood
        by `skrub.train_test_split`."""
        train_idx, test_idx = next(index_generator)
        return X[train_idx], X[test_idx], y[train_idx], y[test_idx]

    results = []
    for (_, test_idx), pipeline in zip(
        cv_splitter.split(prediction_time.skb.eval()), pipelines
    ):
        split = predictions.skb.train_test_split(
            predictions.skb.get_data(),
            splitter=splitter,
            index_generator=index_generator,
        )
        results.append(
            pl.DataFrame(
                {
                    "prediction_time": prediction_time.skb.eval()[test_idx],
                    "load_mw": split["y_test"],
                    "predicted_load_mw": pipeline.predict(split["test"]),
                }
            )
        )
    return results
hgbr_cv_predictions = collect_cv_predictions(
    hgbr_cv_results["pipeline"], ts_cv_5, hgbr_predictions, prediction_time
)
hgbr_cv_predictions[0]
shape: (4_032, 3)
prediction_timeload_mwpredicted_load_mw
datetime[μs, UTC]f64f64
2023-02-11 00:00:00 UTC59258.059855.334418
2023-02-11 01:00:00 UTC58654.059958.654564
2023-02-11 02:00:00 UTC56155.057666.184522
2023-02-11 03:00:00 UTC54463.055832.880673
2023-02-11 04:00:00 UTC54698.057121.984097
2023-07-28 19:00:00 UTC38781.040093.987086
2023-07-28 20:00:00 UTC38455.039343.771368
2023-07-28 21:00:00 UTC39972.040738.151594
2023-07-28 22:00:00 UTC39825.039449.468131
2023-07-28 23:00:00 UTC36822.035828.293662
def lorenz_curve(observed_value, predicted_value, n_samples=1_000):
    """Compute the Lorenz curve for a given true and predicted values."""

    def gini_index(cum_proportion_population, cum_proportion_y_true):
        from sklearn.metrics import auc

        return 1 - 2 * auc(cum_proportion_population, cum_proportion_y_true)

    observed_value = np.asarray(observed_value)
    predicted_value = np.asarray(predicted_value)

    sort_idx = np.argsort(predicted_value)
    observed_value_sorted = observed_value[sort_idx]

    original_n_samples = observed_value_sorted.shape[0]
    cum_proportion_population = np.cumsum(np.ones(original_n_samples))
    cum_proportion_population /= cum_proportion_population[-1]

    cum_proportion_y_true = np.cumsum(observed_value_sorted)
    cum_proportion_y_true /= cum_proportion_y_true[-1]

    gini_model = gini_index(cum_proportion_population, cum_proportion_y_true)

    cum_proportion_population_interpolated = np.linspace(0, 1, n_samples)
    cum_proportion_y_true_interpolated = np.interp(
        cum_proportion_population_interpolated,
        cum_proportion_population,
        cum_proportion_y_true,
    )

    return pl.DataFrame(
        {
            "cum_population": cum_proportion_population_interpolated,
            "cum_observed": cum_proportion_y_true_interpolated,
        }
    ).with_columns(
        pl.lit(gini_model).alias("gini_index"),
    )


def plot_lorenz_curve(cv_predictions, n_samples=1_000):
    """Plot the Lorenz curve for a given true and predicted values."""

    results = []
    for fold_idx, predictions in enumerate(cv_predictions):
        results.append(
            lorenz_curve(
                observed_value=predictions["load_mw"],
                predicted_value=predictions["predicted_load_mw"],
                n_samples=n_samples,
            ).with_columns(
                pl.lit(fold_idx).alias("fold_idx"),
                pl.lit("Model").alias("model"),
            )
        )

        results.append(
            lorenz_curve(
                observed_value=predictions["load_mw"],
                predicted_value=predictions["load_mw"],
                n_samples=n_samples,
            ).with_columns(
                pl.lit(fold_idx).alias("fold_idx"),
                pl.lit("Oracle").alias("model"),
            )
        )

    results = pl.concat(results)

    gini_stats = results.group_by("model").agg(
        [
            pl.col("gini_index")
            .mean()
            .map_elements(lambda x: f"{x:.4f}", return_dtype=pl.String)
            .alias("gini_mean"),
            pl.col("gini_index")
            .std()
            .map_elements(lambda x: f"{x:.4f}", return_dtype=pl.String)
            .alias("gini_std_dev"),
        ]
    )

    results = results.join(gini_stats, on="model").with_columns(
        pl.format("{} (Gini: {} +/- {})", "model", "gini_mean", "gini_std_dev").alias(
            "model_label"
        )
    )

    model_chart = (
        altair.Chart(results)
        .mark_line(strokeDash=[4, 2, 4, 2], opacity=0.8, tooltip=True)
        .encode(
            x=altair.X(
                "cum_population:Q",
                title="Fraction of observations sorted by predicted label",
            ),
            y=altair.Y("cum_observed:Q", title="Cumulative observed load proportion"),
            color=altair.Color(
                "model_label:N", legend=altair.Legend(title="Models"), sort=None
            ),
            detail="fold_idx:N",
        )
    )

    diagonal_chart = (
        altair.Chart(
            pl.DataFrame(
                {
                    "cum_population": [0, 1],
                    "cum_observed": [0, 1],
                    "model_label": "Non-informative model (Gini = 0.0)",
                }
            )
        )
        .mark_line(strokeDash=[4, 4], opacity=0.5, tooltip=True)
        .encode(
            x=altair.X(
                "cum_population:Q",
                title="Fraction of observations sorted by predicted label",
            ),
            y=altair.Y("cum_observed:Q", title="Cumulative observed load proportion"),
            color=altair.Color(
                "model_label:N", legend=altair.Legend(title="Models"), sort=None
            ),
        )
    )

    return model_chart + diagonal_chart


plot_lorenz_curve(hgbr_cv_predictions, n_samples=500).interactive()
def plot_reliability_diagram(
    cv_predictions, kind="mean", quantile_level=0.5, n_bins=10
):
    # min and max load over all predictions and observations for any folds:
    all_loads = pl.concat(
        [
            cv_prediction.select(["load_mw", "predicted_load_mw"])
            for cv_prediction in cv_predictions
        ]
    )
    all_loads = pl.concat(all_loads["load_mw", "predicted_load_mw"])
    min_load, max_load = all_loads.min(), all_loads.max()
    scale = altair.Scale(domain=[min_load, max_load])

    # Create the perfect line
    chart = (
        altair.Chart(
            pl.DataFrame(
                {
                    "mean_predicted_load_mw": [min_load, max_load],
                    "mean_load_mw": [min_load, max_load],
                    "label": ["Perfect"] * 2,
                }
            )
        )
        .mark_line(tooltip=True, opacity=0.8, strokeDash=[5, 5])
        .encode(
            x=altair.X("mean_predicted_load_mw:Q", scale=scale),
            y=altair.Y("mean_load_mw:Q", scale=scale),
            color=altair.Color(
                "label:N",
                scale=altair.Scale(range=["black"]),
                legend=altair.Legend(title="Legend"),
            ),
        )
    )

    # Add lines for each CV fold with date labels
    for fold_idx, cv_predictions_i in enumerate(cv_predictions):
        # Get date range for this CV fold
        min_date = cv_predictions_i["prediction_time"].min().strftime("%Y-%m-%d")
        max_date = cv_predictions_i["prediction_time"].max().strftime("%Y-%m-%d")
        fold_label = f"#{fold_idx} - {min_date} to {max_date}"

        if kind == "mean":
            y_name = "mean_load_mw"
            agg_expr = pl.col("load_mw").mean()
        elif kind == "quantile":
            y_name = "quantile_of_load_mw"
            agg_expr = pl.col("load_mw").quantile(quantile_level)
        else:
            raise ValueError(f"Unknown kind: {kind}. Use 'mean' or 'quantile'.")

        mean_per_bins = (
            cv_predictions_i.group_by(
                pl.col("predicted_load_mw").qcut(np.linspace(0, 1, n_bins))
            )
            .agg(
                [
                    agg_expr.alias(y_name),
                    pl.col("predicted_load_mw").mean().alias("mean_predicted_load_mw"),
                ]
            )
            .sort("predicted_load_mw")
            .with_columns(pl.lit(fold_label).alias("fold_label"))
        )

        chart += (
            altair.Chart(mean_per_bins)
            .mark_line(tooltip=True, point=True, opacity=0.8)
            .encode(
                x=altair.X("mean_predicted_load_mw:Q", scale=scale),
                y=altair.Y(f"{y_name}:Q", scale=scale),
                color=altair.Color(
                    "fold_label:N",
                    legend=altair.Legend(title=None),
                ),
                detail=altair.Detail("fold_label:N"),
            )
        )
    return chart.resolve_scale(color="independent")


plot_reliability_diagram(hgbr_cv_predictions).interactive().properties(
    title="Reliability diagram from cross-validation predictions"
)
def plot_residuals_vs_predicted(cv_predictions):
    """Plot residuals vs predicted values scatter plot for all CV folds."""
    all_scatter_plots = []

    x_title = "Predicted Load (MW)"
    y_title = "Residual load (MW): predicted - actual"

    for i, cv_prediction in enumerate(cv_predictions):
        # Get date range for this CV fold
        min_date = cv_prediction["prediction_time"].min().strftime("%Y-%m-%d")
        max_date = cv_prediction["prediction_time"].max().strftime("%Y-%m-%d")
        fold_label = f"#{i+1} - {min_date} to {max_date}"

        # Calculate residuals
        residuals_data = cv_prediction.with_columns(
            [(pl.col("predicted_load_mw") - pl.col("load_mw")).alias("residual")]
        ).with_columns([pl.lit(fold_label).alias("fold_label")])

        # Create scatter plot for this CV fold
        scatter_plot = (
            altair.Chart(residuals_data)
            .mark_circle(opacity=0.6, size=20)
            .encode(
                x=altair.X(
                    "predicted_load_mw:Q",
                    title=x_title,
                    scale=altair.Scale(zero=False),
                ),
                y=altair.Y("residual:Q", title=y_title),
                color=altair.Color("fold_label:N", legend=None),
                tooltip=[
                    "prediction_time:T",
                    "load_mw:Q",
                    "predicted_load_mw:Q",
                    "residual:Q",
                    "fold_label:N",
                ],
            )
        )

        all_scatter_plots.append(scatter_plot)

    # Get the range of predicted values for the perfect line
    all_predictions = pl.concat(
        [cv_pred["predicted_load_mw"] for cv_pred in cv_predictions]
    )
    min_pred, max_pred = all_predictions.min(), all_predictions.max()

    # Create perfect residuals line at y=0
    perfect_line = (
        altair.Chart(
            pl.DataFrame(
                {
                    "predicted_load_mw": [min_pred, max_pred],
                    "perfect_residual": [0, 0],
                    "label": ["Perfect"] * 2,
                }
            )
        )
        .mark_line(strokeDash=[5, 5], opacity=0.8, color="black")
        .encode(
            x=altair.X("predicted_load_mw:Q", title=x_title),
            y=altair.Y("perfect_residual:Q", title=y_title),
            color=altair.Color(
                "label:N",
                scale=altair.Scale(range=["black"]),
                legend=None,
            ),
        )
    )
    # Combine all scatter plots
    combined_scatter = all_scatter_plots[0]
    for plot in all_scatter_plots[1:]:
        combined_scatter += plot

    # Layer the scatter plots with the perfect line
    return (combined_scatter + perfect_line).resolve_scale(color="independent")


plot_residuals_vs_predicted(hgbr_cv_predictions).interactive().properties(
    title="Residuals vs Predicted Values from cross-validation predictions"
)
def plot_binned_residuals(cv_predictions, by="hour"):
    """Plot the average residuals binned by time period, one line per CV fold."""
    # Configure binning based on the 'by' parameter
    if by == "hour":
        time_column = "hour_of_day"
        time_extractor = pl.col("prediction_time").dt.hour().alias(time_column)
        x_title = "Hour of day"
    elif by == "month":
        time_column = "month_of_year"
        time_extractor = pl.col("prediction_time").dt.month().alias(time_column)
        x_title = "Month of year"
    else:
        raise ValueError(f"Unsupported binning method: {by}. Use 'hour' or 'month'.")

    all_iqr_bands = []
    all_mean_lines = []
    time_range = None  # Will store the min/max time values for the perfect line

    for i, cv_prediction in enumerate(cv_predictions):
        # Get date range for this CV fold
        min_date = cv_prediction["prediction_time"].min().strftime("%Y-%m-%d")
        max_date = cv_prediction["prediction_time"].max().strftime("%Y-%m-%d")
        fold_label = f"#{i+1} - {min_date} to {max_date}"

        # Create residuals and time binning columns
        residuals_detailed = cv_prediction.with_columns(
            [
                (pl.col("predicted_load_mw") - pl.col("load_mw")).alias("residual"),
                time_extractor,
            ]
        )

        # Calculate statistics for this CV fold
        residuals_stats = (
            residuals_detailed.group_by(time_column)
            .agg(
                [
                    pl.col("residual").mean().round(1).alias("mean_residual"),
                    pl.col("residual").quantile(0.25).round(1).alias("q25_residual"),
                    pl.col("residual").quantile(0.75).round(1).alias("q75_residual"),
                ]
            )
            .sort(time_column)
            .with_columns(pl.lit(fold_label).alias("fold_label"))
        )

        # Store time range for perfect line (use the first CV fold)
        if time_range is None:
            time_range = (
                residuals_stats[time_column].min(),
                residuals_stats[time_column].max(),
            )
        else:
            time_range = (
                min(time_range[0], residuals_stats[time_column].min()),
                max(time_range[1], residuals_stats[time_column].max()),
            )
        # Create IQR band for this CV fold
        iqr_band = (
            altair.Chart(residuals_stats)
            .mark_area(opacity=0.15)
            .encode(
                x=altair.X(f"{time_column}:O", title=x_title),
                y=altair.Y("q25_residual:Q"),
                y2=altair.Y2("q75_residual:Q"),
            )
        )

        # Create mean line for this CV fold
        mean_line = (
            altair.Chart(residuals_stats)
            .mark_line(tooltip=True, point=True, opacity=0.8)
            .encode(
                x=altair.X(f"{time_column}:O", title=x_title),
                y=altair.Y("mean_residual:Q", title="Mean residual (MW)"),
                color=altair.Color("fold_label:N", legend=None),
                detail="fold_label:N",
            )
        )

        all_iqr_bands.append(iqr_band)
        all_mean_lines.append(mean_line)

    # Create perfect residuals line at y=0
    perfect_line = (
        altair.Chart(
            pl.DataFrame(
                {
                    time_column: [time_range[0], time_range[1]],
                    "perfect_residual": [0, 0],
                    "label": ["Perfect"] * 2,
                }
            )
        )
        .mark_line(strokeDash=[5, 5], opacity=0.8, color="black")
        .encode(
            x=altair.X(f"{time_column}:O", title=x_title),
            y=altair.Y("perfect_residual:Q", title="Mean residual (MW)"),
            color=altair.Color(
                "label:N",
                scale=altair.Scale(range=["black"]),
                legend=None,
            ),
        )
    )

    # Combine all IQR bands
    combined_iqr = all_iqr_bands[0]
    for band in all_iqr_bands[1:]:
        combined_iqr += band

    # Combine all mean lines
    combined_lines = all_mean_lines[0]
    for line in all_mean_lines[1:]:
        combined_lines += line

    # Layer the IQR bands behind the mean lines, with perfect line on top
    return (combined_iqr + combined_lines + perfect_line).resolve_scale(
        color="independent"
    )


plot_binned_residuals(hgbr_cv_predictions, by="hour").interactive().properties(
    title="Residuals by hour of the day from cross-validation predictions"
)
plot_binned_residuals(hgbr_cv_predictions, by="month").interactive().properties(
    title="Residuals by hour of the day from cross-validation predictions"
)
ts_cv_2 = TimeSeriesSplit(
    n_splits=2, test_size=test_size, max_train_size=max_train_size, gap=24
)
randomized_search_hgbr = hgbr_predictions.skb.get_randomized_search(
    cv=ts_cv_2,
    scoring="r2",
    n_iter=100,
    fitted=True,
    verbose=1,
    n_jobs=-1,
)
Fitting 2 folds for each of 100 candidates, totalling 200 fits
randomized_search_hgbr.results_.round(3)
dropped_cols learning_rate loss max_leaf_nodes mean_test_score
0 none 0.120 poisson 46 0.979
1 rolling_load 0.136 poisson 54 0.979
2 rolling_load 0.261 poisson 13 0.978
3 future_1h 0.177 squared_error 20 0.978
4 rolling_load 0.167 poisson 64 0.977
... ... ... ... ... ...
95 cloud_cover 0.011 gamma 112 0.806
96 future_24h 0.011 poisson 31 0.802
97 calendar 0.016 squared_error 4 0.797
98 holiday 0.011 squared_error 8 0.782
99 moisture 0.010 poisson 4 0.692

100 rows × 5 columns

randomized_search_hgbr.plot_results().update_layout(margin=dict(l=150))
# nested_cv_results = skrub.cross_validate(
#     environment=predictions.skb.get_data(),
#     pipeline=randomized_search,
#     cv=ts_cv_5,
#     scoring={
#         "r2": get_scorer("r2"),
#         "mape": make_scorer(mean_absolute_percentage_error),
#     },
#     n_jobs=-1,
#     return_pipeline=True,
# ).round(3)
# nested_cv_results
# for outer_fold_idx in range(len(nested_cv_results)):
#     print(
#         nested_cv_results.loc[outer_fold_idx, "pipeline"]
#         .results_.loc[0]
#         .round(3)
#         .to_dict()
#     )
# TODO: Exercise applying a linear model with some additional feature engineering
from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import SplineTransformer

model = make_pipeline(
    SimpleImputer(add_indicator=True),
    SplineTransformer(sparse_output=True),
    VarianceThreshold(threshold=1e-6),
    SelectKBest(k=skrub.choose_int(100, 1_000, log=True, name="n_selected_splines")),
    Nystroem(
        n_components=skrub.choose_int(
            10, 200, log=True, name="n_components", default=150
        )
    ),
    Ridge(alpha=skrub.choose_float(1e-6, 1e3, log=True, name="alpha", default=1e-2)),
)

predictions_ridge = features_with_dropped_cols.skb.apply(model, y=target)
predictions_ridge
<Apply Pipeline>
Show graph Var 'prediction_start_time' Call 'define_prediction_time_range' Var 'prediction_end_time' X: Call 'build_features' Call 'build_targets' Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Call 'load_weather_data' Call 'prepare_french_calendar_data' Var 'data_source_folder' CallMethod 'with_columns' CallMethod 'with_columns' Var 'city_names' Apply DropCols Apply Pipeline y: GetItem 'load_mw_horizon_24h'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

altair.Chart(
    pl.concat(
        [
            targets.skb.eval(),
            predictions_ridge.rename(
                {target_column_name: predicted_target_column_name}
            ).skb.eval(),
        ],
        how="horizontal",
    ).tail(24 * 7)
).transform_fold(
    [target_column_name, predicted_target_column_name],
).mark_line(
    tooltip=True
).encode(
    x="prediction_time:T", y="value:Q", color="key:N"
).interactive()
cv_results_ridge = predictions_ridge.skb.cross_validate(
    cv=ts_cv_5,
    scoring={
        "r2": get_scorer("r2"),
        "mape": make_scorer(mean_absolute_percentage_error),
    },
    return_train_score=True,
    return_pipeline=True,
    verbose=1,
    n_jobs=-1,
)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py:782: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   25.0s finished
cv_predictions_ridge = collect_cv_predictions(
    cv_results_ridge["pipeline"], ts_cv_5, predictions_ridge, prediction_time
)
plot_lorenz_curve(cv_predictions_ridge, n_samples=500).interactive()
plot_reliability_diagram(cv_predictions_ridge).interactive().properties(
    title="Reliability diagram from cross-validation predictions"
)
randomized_search_ridge = predictions_ridge.skb.get_randomized_search(
    cv=ts_cv_2,
    scoring="r2",
    n_iter=100,
    fitted=True,
    verbose=1,
    n_jobs=-1,
)
Fitting 2 folds for each of 100 candidates, totalling 200 fits
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=829 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=829 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=781 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=781 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=882 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=882 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=606 is greater than n_features=234. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=606 is greater than n_features=222. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=945 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=945 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=614 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=614 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=390 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=390 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=900 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=900 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=693 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=693 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=988 is greater than n_features=966. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=389 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=389 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=807 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=807 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=934 is greater than n_features=838. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=999 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=999 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=707 is greater than n_features=234. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=707 is greater than n_features=222. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=303 is greater than n_features=234. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=303 is greater than n_features=222. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=638 is greater than n_features=234. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=638 is greater than n_features=222. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=906 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=906 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=767 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=767 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=998 is greater than n_features=964. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=872 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=872 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=191 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=191 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=947 is greater than n_features=908. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=554 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=913 is greater than n_features=584. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=913 is greater than n_features=524. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=285 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=285 is greater than n_features=138. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=838 is greater than n_features=234. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:782: UserWarning: k=838 is greater than n_features=222. All the features will be returned.
  warnings.warn(
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: divide by zero encountered in divide
  f = msb / msw
randomized_search_ridge.plot_results().update_layout(margin=dict(l=200))

We observe that the default values of the hyper-parameters are in the optimal region explored by the randomized search. This is a good sign that the model is well-specified and that the hyper-parameters are not too sensitive to small changes of those values.

We could further assess the stability of those optimal hyper-parameters by running a nested cross-validation, where we would perform a randomized search for each fold of the outer cross-validation loop as below but this is computationally expensive.

# nested_cv_results_ridge = skrub.cross_validate(
#     environment=predictions_ridge.skb.get_data(),
#     pipeline=randomized_search_ridge,
#     cv=ts_cv_5,
#     scoring={
#         "r2": get_scorer("r2"),
#         "mape": make_scorer(mean_absolute_percentage_error),
#     },
#     n_jobs=-1,
#     return_pipeline=True,
# ).round(3)
# nested_cv_results_ridge.round(3)
from sklearn.multioutput import MultiOutputRegressor

multioutput_predictions = features_with_dropped_cols.skb.apply(
    MultiOutputRegressor(
        estimator=HistGradientBoostingRegressor(random_state=0), n_jobs=-1
    ),
    y=targets.skb.drop(cols=["prediction_time", "load_mw"]).skb.mark_as_y(),
).skb.set_name("multioutput_gbdt")
target_column_names = [target_column_name_pattern.format(horizon=h) for h in horizons]
predicted_target_column_names = [
    f"predicted_{target_column_name}" for target_column_name in target_column_names
]
named_predictions = multioutput_predictions.rename(
    {k: v for k, v in zip(target_column_names, predicted_target_column_names)}
)
import datetime


def plot_horizon_forecast(
    targets, named_predictions, plot_at_time, historical_timedelta
):
    """Plot the true target and the forecast values."""
    merged_data = targets.skb.select(cols=["prediction_time", "load_mw"]).skb.concat(
        [named_predictions], axis=1
    )
    start_time = plot_at_time - historical_timedelta
    end_time = plot_at_time + datetime.timedelta(
        hours=named_predictions.skb.eval().shape[1]
    )
    true_values_past = merged_data.filter(
        pl.col("prediction_time").is_between(start_time, plot_at_time, closed="both")
    ).rename({"load_mw": "Past true load"})
    true_values_future = merged_data.filter(
        pl.col("prediction_time").is_between(plot_at_time, end_time, closed="both")
    ).rename({"load_mw": "Future true load"})
    predicted_record = (
        merged_data.skb.select(
            cols=skrub.selectors.filter_names(str.startswith, "predict")
        )
        .row(by_predicate=pl.col("prediction_time") == plot_at_time, named=True)
        .skb.eval()
    )
    forecast_values = pl.DataFrame(
        {
            "prediction_time": predicted_record["prediction_time"]
            + datetime.timedelta(hours=horizon),
            "Forecast load": predicted_record[
                "predicted_" + target_column_name_pattern.format(horizon=horizon)
            ],
        }
        for horizon in range(1, len(predicted_record))
    )

    true_values_past_chart = (
        altair.Chart(true_values_past.skb.eval())
        .transform_fold(["Past true load"])
        .mark_line(tooltip=True)
        .encode(x="prediction_time:T", y="Past true load:Q", color="key:N")
    )
    true_values_future_chart = (
        altair.Chart(true_values_future.skb.eval())
        .transform_fold(["Future true load"])
        .mark_line(tooltip=True)
        .encode(x="prediction_time:T", y="Future true load:Q", color="key:N")
    )
    forecast_values_chart = (
        altair.Chart(forecast_values)
        .transform_fold(["Forecast load"])
        .mark_line(tooltip=True)
        .encode(x="prediction_time:T", y="Forecast load:Q", color="key:N")
    )
    return (
        true_values_past_chart + true_values_future_chart + forecast_values_chart
    ).interactive()
plot_at_time = datetime.datetime(2025, 5, 24, 0, 0, tzinfo=datetime.timezone.utc)
historical_timedelta = datetime.timedelta(hours=24 * 5)
plot_horizon_forecast(targets, named_predictions, plot_at_time, historical_timedelta)
plot_at_time = datetime.datetime(2025, 5, 25, 0, 0, tzinfo=datetime.timezone.utc)
plot_horizon_forecast(targets, named_predictions, plot_at_time, historical_timedelta)
from sklearn.metrics import r2_score


def multioutput_scorer(regressor, X, y, score_func, score_name):
    y_pred = regressor.predict(X)
    return {
        f"{score_name}_horizon_{h}h": score
        for h, score in enumerate(
            score_func(y, y_pred, multioutput="raw_values"), start=1
        )
    }


def scoring(regressor, X, y):
    return {
        **multioutput_scorer(regressor, X, y, mean_absolute_percentage_error, "mape"),
        **multioutput_scorer(regressor, X, y, r2_score, "r2"),
    }


multioutput_cv_results = multioutput_predictions.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_train_score=True,
    verbose=1,
    n_jobs=-1,
).round(3)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/home/runner/work/forecasting/forecasting/.pixi/envs/doc/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py:782: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.9min finished
multioutput_cv_results
fit_time score_time test_mape_horizon_1h train_mape_horizon_1h test_mape_horizon_2h train_mape_horizon_2h test_mape_horizon_3h train_mape_horizon_3h test_mape_horizon_4h train_mape_horizon_4h ... test_r2_horizon_20h train_r2_horizon_20h test_r2_horizon_21h train_r2_horizon_21h test_r2_horizon_22h train_r2_horizon_22h test_r2_horizon_23h train_r2_horizon_23h test_r2_horizon_24h train_r2_horizon_24h
0 74.987 1.793 0.013 0.007 0.020 0.011 0.025 0.012 0.027 0.013 ... 0.949 0.993 0.958 0.994 0.959 0.994 0.962 0.995 0.963 0.995
1 82.568 1.665 0.016 0.008 0.025 0.012 0.029 0.014 0.030 0.014 ... 0.970 0.992 0.974 0.993 0.975 0.994 0.977 0.994 0.977 0.994
2 82.205 1.593 0.016 0.009 0.023 0.013 0.025 0.015 0.027 0.015 ... 0.960 0.991 0.970 0.993 0.971 0.993 0.973 0.993 0.973 0.993
3 81.268 1.901 0.012 0.010 0.018 0.014 0.021 0.015 0.023 0.016 ... 0.973 0.989 0.977 0.992 0.980 0.993 0.979 0.993 0.979 0.993
4 20.194 0.468 0.013 0.010 0.019 0.014 0.023 0.016 0.026 0.016 ... 0.969 0.989 0.974 0.992 0.976 0.993 0.978 0.993 0.978 0.993

5 rows × 98 columns

import itertools
from IPython.display import display

for metric_name, dataset_type in itertools.product(["mape", "r2"], ["train", "test"]):
    columns = multioutput_cv_results.columns[
        multioutput_cv_results.columns.str.startswith(f"{dataset_type}_{metric_name}")
    ]
    data_to_plot = multioutput_cv_results[columns]
    data_to_plot.columns = [
        col.replace(f"{dataset_type}_", "")
        .replace(f"{metric_name}_", "")
        .replace("_", " ")
        for col in columns
    ]

    data_long = data_to_plot.melt(var_name="horizon", value_name="score")
    chart = (
        altair.Chart(
            data_long,
            title=f"{dataset_type.title()} {metric_name.upper()} Scores by Horizon",
        )
        .mark_boxplot(extent="min-max")
        .encode(
            x=altair.X(
                "horizon:N",
                title="Horizon",
                sort=altair.Sort(
                    [f"horizon {h}h" for h in range(1, data_to_plot.shape[1])]
                ),
            ),
            y=altair.Y("score:Q", title=f"{metric_name.upper()} Score"),
            color=altair.Color("horizon:N", legend=None),
        )
    )

    display(chart)
# TODO: Exercise using RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

multioutput_predictions_rf = features_with_dropped_cols.skb.apply(
    RandomForestRegressor(min_samples_leaf=30, random_state=0, n_jobs=-1),
    y=targets.skb.drop(cols=["prediction_time", "load_mw"]).skb.mark_as_y(),
).skb.set_name("random_forest")
named_predictions_rf = multioutput_predictions_rf.rename(
    {k: v for k, v in zip(target_column_names, predicted_target_column_names)}
)
plot_at_time = datetime.datetime(2025, 5, 24, 0, 0, tzinfo=datetime.timezone.utc)
historical_timedelta = datetime.timedelta(hours=24 * 5)
plot_horizon_forecast(targets, named_predictions_rf, plot_at_time, historical_timedelta)
plot_at_time = datetime.datetime(2025, 5, 25, 0, 0, tzinfo=datetime.timezone.utc)
plot_horizon_forecast(targets, named_predictions_rf, plot_at_time, historical_timedelta)
multioutput_cv_results_rf = multioutput_predictions_rf.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_train_score=True,
    verbose=1,
    n_jobs=-1,
)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.3min finished
multioutput_cv_results_rf.round(3)
fit_time score_time test_mape_horizon_1h train_mape_horizon_1h test_mape_horizon_2h train_mape_horizon_2h test_mape_horizon_3h train_mape_horizon_3h test_mape_horizon_4h train_mape_horizon_4h ... test_r2_horizon_20h train_r2_horizon_20h test_r2_horizon_21h train_r2_horizon_21h test_r2_horizon_22h train_r2_horizon_22h test_r2_horizon_23h train_r2_horizon_23h test_r2_horizon_24h train_r2_horizon_24h
0 90.150 0.373 0.031 0.023 0.035 0.026 0.036 0.027 0.038 0.029 ... 0.883 0.947 0.885 0.949 0.884 0.948 0.881 0.947 0.876 0.945
1 106.325 0.192 0.033 0.023 0.038 0.026 0.040 0.028 0.042 0.029 ... 0.914 0.950 0.918 0.950 0.918 0.949 0.919 0.948 0.918 0.946
2 112.280 0.201 0.028 0.024 0.033 0.027 0.035 0.029 0.037 0.030 ... 0.901 0.948 0.904 0.949 0.904 0.947 0.904 0.946 0.901 0.944
3 108.346 0.210 0.023 0.024 0.027 0.028 0.029 0.030 0.031 0.031 ... 0.901 0.943 0.904 0.945 0.904 0.944 0.904 0.942 0.902 0.940
4 43.784 0.084 0.031 0.024 0.035 0.027 0.038 0.029 0.041 0.031 ... 0.891 0.944 0.894 0.945 0.896 0.944 0.896 0.942 0.893 0.940

5 rows × 98 columns

import itertools
from IPython.display import display

for metric_name, dataset_type in itertools.product(["mape", "r2"], ["train", "test"]):
    columns = multioutput_cv_results_rf.columns[
        multioutput_cv_results.columns.str.startswith(f"{dataset_type}_{metric_name}")
    ]
    data_to_plot = multioutput_cv_results_rf[columns]
    data_to_plot.columns = [
        col.replace(f"{dataset_type}_", "")
        .replace(f"{metric_name}_", "")
        .replace("_", " ")
        for col in columns
    ]

    data_long = data_to_plot.melt(var_name="horizon", value_name="score")
    chart = (
        altair.Chart(
            data_long,
            title=f"{dataset_type.title()} {metric_name.upper()} Scores by Horizon",
        )
        .mark_boxplot(extent="min-max")
        .encode(
            x=altair.X(
                "horizon:N",
                title="Horizon",
                sort=altair.Sort(
                    [f"horizon {h}h" for h in range(1, data_to_plot.shape[1])]
                ),
            ),
            y=altair.Y("score:Q", title=f"{metric_name.upper()} Score"),
            color=altair.Color("horizon:N", legend=None),
        )
    )

    display(chart)
from sklearn.metrics import d2_pinball_score

scoring = {
    "r2": get_scorer("r2"),
    "mape": make_scorer(mean_absolute_percentage_error),
    "d2_pinball_05_loss": make_scorer(d2_pinball_score, alpha=0.05),
    "d2_pinball_50_loss": make_scorer(d2_pinball_score, alpha=0.5),
    "d2_pinball_95_loss": make_scorer(d2_pinball_score, alpha=0.95),
}
common_params = dict(
    loss="quantile", learning_rate=0.1, max_leaf_nodes=100, random_state=0
)
predictions_gbrt_05 = features_with_dropped_cols.skb.apply(
    HistGradientBoostingRegressor(**common_params, quantile=0.05),
    y=target,
)
predictions_gbrt_50 = features_with_dropped_cols.skb.apply(
    HistGradientBoostingRegressor(**common_params, quantile=0.5),
    y=target,
)
predictions_gbrt_95 = features_with_dropped_cols.skb.apply(
    HistGradientBoostingRegressor(**common_params, quantile=0.95),
    y=target,
)
cv_results_05 = predictions_gbrt_05.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_pipeline=True,
    verbose=1,
    n_jobs=-1,
)
cv_results_50 = predictions_gbrt_50.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_pipeline=True,
    verbose=1,
    n_jobs=-1,
)
cv_results_95 = predictions_gbrt_95.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_pipeline=True,
    verbose=1,
    n_jobs=-1,
)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   10.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   12.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   10.2s finished
cv_results_05[[col for col in cv_results_05.columns if col.startswith("test_")]].mean(
    axis=0
).round(3)
test_r2                    0.853
test_mape                  0.051
test_d2_pinball_05_loss    0.698
test_d2_pinball_50_loss    0.644
test_d2_pinball_95_loss   -1.250
dtype: float64
cv_results_50[[col for col in cv_results_50.columns if col.startswith("test_")]].mean(
    axis=0
).round(3)
test_r2                    0.970
test_mape                  0.024
test_d2_pinball_05_loss    0.180
test_d2_pinball_50_loss    0.841
test_d2_pinball_95_loss    0.494
dtype: float64
cv_results_95[[col for col in cv_results_95.columns if col.startswith("test_")]].mean(
    axis=0
).round(3)
test_r2                    0.848
test_mape                  0.064
test_d2_pinball_05_loss   -2.392
test_d2_pinball_50_loss    0.611
test_d2_pinball_95_loss    0.775
dtype: float64
results = pl.concat(
    [
        targets.skb.select(cols=["prediction_time", target_column_name]).skb.eval(),
        predictions_gbrt_05.rename({target_column_name: "quantile_05"}).skb.eval(),
        predictions_gbrt_50.rename({target_column_name: "median"}).skb.eval(),
        predictions_gbrt_95.rename({target_column_name: "quantile_95"}).skb.eval(),
    ],
    how="horizontal",
).tail(24 * 7)
median_chart = (
    altair.Chart(results)
    .transform_fold([target_column_name, "median"])
    .mark_line(tooltip=True)
    .encode(x="prediction_time:T", y="value:Q", color="key:N")
)

quantile_band_chart = (
    altair.Chart(results)
    .mark_area(opacity=0.4, tooltip=True)
    .encode(
        x="prediction_time:T",
        y="quantile_05:Q",
        y2="quantile_95:Q",
        color=altair.value("lightgreen"),
    )
)

combined_chart = quantile_band_chart + median_chart
combined_chart.interactive()
cv_predictions_05 = collect_cv_predictions(
    cv_results_05["pipeline"], ts_cv_5, predictions_gbrt_05, prediction_time
)
cv_predictions_50 = collect_cv_predictions(
    cv_results_50["pipeline"], ts_cv_5, predictions_gbrt_50, prediction_time
)
cv_predictions_95 = collect_cv_predictions(
    cv_results_95["pipeline"], ts_cv_5, predictions_gbrt_95, prediction_time
)
plot_residuals_vs_predicted(cv_predictions_05).interactive().properties(
    title=(
        "Residuals vs Predicted Values from cross-validation predictions"
        " for quantile 0.05"
    )
)
plot_residuals_vs_predicted(cv_predictions_50).interactive().properties(
    title=(
        "Residuals vs Predicted Values from cross-validation predictions" " for median"
    )
)
plot_residuals_vs_predicted(cv_predictions_95).interactive().properties(
    title=(
        "Residuals vs Predicted Values from cross-validation predictions"
        " for quantile 0.95"
    )
)
cv_predictions_05_concat = pl.concat(cv_predictions_05, how="vertical")
cv_predictions_50_concat = pl.concat(cv_predictions_50, how="vertical")
cv_predictions_95_concat = pl.concat(cv_predictions_95, how="vertical")
import matplotlib.pyplot as plt
from sklearn.metrics import PredictionErrorDisplay


for kind in ["actual_vs_predicted", "residual_vs_predicted"]:
    fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey=True)

    PredictionErrorDisplay.from_predictions(
        y_true=cv_predictions_05_concat["load_mw"].to_numpy(),
        y_pred=cv_predictions_05_concat["predicted_load_mw"].to_numpy(),
        kind=kind,
        ax=axs[0],
    )
    axs[0].set_title("0.05 quantile regression")

    PredictionErrorDisplay.from_predictions(
        y_true=cv_predictions_50_concat["load_mw"].to_numpy(),
        y_pred=cv_predictions_50_concat["predicted_load_mw"].to_numpy(),
        kind=kind,
        ax=axs[1],
    )
    axs[1].set_title("Median regression")

    PredictionErrorDisplay.from_predictions(
        y_true=cv_predictions_95_concat["load_mw"].to_numpy(),
        y_pred=cv_predictions_95_concat["predicted_load_mw"].to_numpy(),
        kind=kind,
        ax=axs[2],
    )
    axs[2].set_title("0.95 quantile regression")

    fig.suptitle(f"{kind} for GBRT minimzing different quantile losses")
../../_images/9655249f33a5aa5394706ec031136306228dce035419de73941a858c45977836.png ../../_images/afc0932fd0be39a279d3c5f555b0622950f4ace12497523cb71d50c6bc5a2654.png
def coverage(y_true, y_quantile_low, y_quantile_high):
    y_true = np.asarray(y_true)
    y_quantile_low = np.asarray(y_quantile_low)
    y_quantile_high = np.asarray(y_quantile_high)
    return float(
        np.logical_and(y_true >= y_quantile_low, y_true <= y_quantile_high)
        .mean()
        .round(4)
    )


def mean_width(y_true, y_quantile_low, y_quantile_high):
    y_true = np.asarray(y_true)
    y_quantile_low = np.asarray(y_quantile_low)
    y_quantile_high = np.asarray(y_quantile_high)
    return float(np.abs(y_quantile_high - y_quantile_low).mean().round(1))


def binned_coverage(y_true_folds, y_quantile_low, y_quantile_high, n_bins=10):
    """Compute coverage after binning true values using quantile-based binning.

    Parameters
    ----------
    y_true_folds : list of numpy.ndarray
        List of true target values, one array per CV fold
    y_quantile_low : list of numpy.ndarray
        List of lower quantile predictions, one array per CV fold
    y_quantile_high : list of numpy.ndarray
        List of upper quantile predictions, one array per CV fold
    n_bins : int, default=10
        Number of bins to create

    Returns
    -------
    pandas.DataFrame
        DataFrame with columns: bin_left, bin_right, bin_center, fold_idx,
        coverage, mean_width, n_samples
    """
    # Use all true values to define global bin boundaries
    all_true_values = np.concatenate(y_true_folds)
    df = pd.DataFrame({"bin_by": all_true_values})
    df["bin"] = pd.qcut(df["bin_by"], q=n_bins, labels=False, duplicates="drop")

    # Get bin boundaries for consistent binning across folds
    bin_boundaries = []
    for bin_idx in sorted(df["bin"].dropna().unique()):
        bin_mask = df["bin"] == bin_idx
        bin_values = df.loc[bin_mask, "bin_by"]
        bin_boundaries.append((bin_values.min(), bin_values.max()))

    results = []
    n_folds = len(y_quantile_low)

    for fold_idx in range(n_folds):
        fold_true = y_true_folds[fold_idx]
        fold_low = y_quantile_low[fold_idx]
        fold_high = y_quantile_high[fold_idx]

        # Assign each sample in this fold to a bin
        fold_bins = (
            np.digitize(fold_true, bins=[b[0] for b in bin_boundaries] + [np.inf]) - 1
        )

        for bin_idx, (bin_left, bin_right) in enumerate(bin_boundaries):
            # Get samples from this fold that fall into this bin
            bin_mask = fold_bins == bin_idx

            if np.sum(bin_mask) == 0:
                # No samples in this bin for this fold
                continue

            fold_bin_true = fold_true[bin_mask]
            fold_bin_low = fold_low[bin_mask]
            fold_bin_high = fold_high[bin_mask]

            bin_center = (bin_left + bin_right) / 2
            n_samples_in_bin = len(fold_bin_true)

            coverage_score = coverage(fold_bin_true, fold_bin_low, fold_bin_high)
            width = mean_width(fold_bin_true, fold_bin_low, fold_bin_high)

            results.append(
                {
                    "bin_left": bin_left,
                    "bin_right": bin_right,
                    "bin_center": bin_center,
                    "fold_idx": fold_idx,
                    "coverage": coverage_score,
                    "mean_width": width,
                    "n_samples": n_samples_in_bin,
                }
            )

    return pd.DataFrame(results)
coverage(
    cv_predictions_50_concat["load_mw"].to_numpy(),
    cv_predictions_05_concat["predicted_load_mw"].to_numpy(),
    cv_predictions_95_concat["predicted_load_mw"].to_numpy(),
)

mean_width(
    cv_predictions_50_concat["load_mw"].to_numpy(),
    cv_predictions_05_concat["predicted_load_mw"].to_numpy(),
    cv_predictions_95_concat["predicted_load_mw"].to_numpy(),
)

# Compute binned coverage scores
binned_coverage_results = binned_coverage(
    [df["load_mw"].to_numpy() for df in cv_predictions_50],
    [df["predicted_load_mw"].to_numpy() for df in cv_predictions_05],
    [df["predicted_load_mw"].to_numpy() for df in cv_predictions_95],
    n_bins=10,
)

binned_coverage_results
bin_left bin_right bin_center fold_idx coverage mean_width n_samples
0 28744.0 36884.0 32814.0 0 0.4205 4235.5 459
1 36886.0 40158.0 38522.0 0 0.6493 4291.2 499
2 40160.0 42982.0 41571.0 0 0.7117 4501.7 496
3 42983.0 45219.0 44101.0 0 0.7400 4281.1 473
4 45220.0 47332.0 46276.0 0 0.7401 4071.5 454
5 47334.0 49718.0 48526.0 0 0.7044 3868.6 477
6 49720.0 53126.0 51423.0 0 0.7418 4679.3 426
7 53129.0 57570.0 55349.5 0 0.8162 6144.5 321
8 57572.0 63173.0 60372.5 0 0.8687 7106.8 259
9 63176.0 86573.0 74874.5 0 0.8929 8807.9 168
10 28744.0 36884.0 32814.0 1 0.4637 3917.9 496
11 36886.0 40158.0 38522.0 1 0.7960 4557.4 402
12 40160.0 42982.0 41571.0 1 0.8550 4842.4 407
13 42983.0 45219.0 44101.0 1 0.8783 4546.3 378
14 45220.0 47332.0 46276.0 1 0.8939 4620.4 377
15 47334.0 49718.0 48526.0 1 0.8648 5101.1 392
16 49720.0 53126.0 51423.0 1 0.8450 5774.8 400
17 53129.0 57570.0 55349.5 1 0.8657 6392.7 402
18 57572.0 63173.0 60372.5 1 0.8005 6515.6 386
19 63176.0 86573.0 74874.5 1 0.6505 7550.9 392
20 28744.0 36884.0 32814.0 2 0.6887 4985.7 318
21 36886.0 40158.0 38522.0 2 0.7657 5111.9 367
22 40160.0 42982.0 41571.0 2 0.7731 5186.4 357
23 42983.0 45219.0 44101.0 2 0.8050 4952.7 400
24 45220.0 47332.0 46276.0 2 0.8814 4906.7 413
25 47334.0 49718.0 48526.0 2 0.8951 4984.1 410
26 49720.0 53126.0 51423.0 2 0.8716 5976.0 475
27 53129.0 57570.0 55349.5 2 0.8577 6568.6 534
28 57572.0 63173.0 60372.5 2 0.8386 6386.9 446
29 63176.0 86573.0 74874.5 2 0.6731 7098.5 312
30 28744.0 36884.0 32814.0 3 0.7563 4199.4 513
31 36886.0 40158.0 38522.0 3 0.8534 3682.2 498
32 40160.0 42982.0 41571.0 3 0.8358 3972.7 481
33 42983.0 45219.0 44101.0 3 0.8337 3750.7 457
34 45220.0 47332.0 46276.0 3 0.8767 3718.1 511
35 47334.0 49718.0 48526.0 3 0.7968 4168.4 497
36 49720.0 53126.0 51423.0 3 0.7019 4535.3 369
37 53129.0 57570.0 55349.5 3 0.8386 5236.3 254
38 57572.0 63173.0 60372.5 3 0.7041 5945.4 267
39 63176.0 86573.0 74874.5 3 0.3568 7160.5 185
40 28744.0 36884.0 32814.0 4 0.6217 4230.8 230
41 36886.0 40158.0 38522.0 4 0.7570 4435.0 251
42 40160.0 42982.0 41571.0 4 0.8723 4569.2 274
43 42983.0 45219.0 44101.0 4 0.8350 4213.3 309
44 45220.0 47332.0 46276.0 4 0.8385 4167.3 260
45 47334.0 49718.0 48526.0 4 0.7137 4547.9 241
46 49720.0 53126.0 51423.0 4 0.8324 5691.7 346
47 53129.0 57570.0 55349.5 4 0.8591 6481.9 504
48 57572.0 63173.0 60372.5 4 0.7857 6359.0 658
49 63176.0 86573.0 74874.5 4 0.5495 6589.4 959
coverage_by_bin = binned_coverage_results.copy()
coverage_by_bin["bin_label"] = coverage_by_bin.apply(
    lambda row: f"[{row.bin_left:.0f}, {row.bin_right:.0f}]", axis=1
)

Reliability diagram for quantile regression#

plot_reliability_diagram(
    cv_predictions_50, kind="quantile", quantile_level=0.50
).interactive().properties(
    title="Reliability diagram for quantile 0.50 from cross-validation predictions"
)
plot_reliability_diagram(
    cv_predictions_05, kind="quantile", quantile_level=0.05
).interactive().properties(
    title="Reliability diagram for quantile 0.05 from cross-validation predictions"
)
plot_reliability_diagram(
    cv_predictions_95, kind="quantile", quantile_level=0.95
).interactive().properties(
    title="Reliability diagram for quantile 0.95 from cross-validation predictions"
)
ax = coverage_by_bin.boxplot(
    column="coverage", by="bin_label", figsize=(12, 6), vert=False, whis=1000
)
ax.axvline(x=0.9, color="red", linestyle="--", label="Target coverage (0.9)")
ax.set_xlabel("Load bins (MW)")
ax.set_ylabel("Coverage")
ax.set_title("Coverage Distribution by Load Bins")
ax.legend()
plt.suptitle("")  # Remove automatic suptitle from boxplot
plt.xticks(rotation=45)
plt.tight_layout()
../../_images/f5b432944c2399533eb79d55da7adaff4a8437fbb4b8309f5baeca59f9e2dce4.png

Quantile regression as classification#

In the following, we turn a quantile regression problem for all possible quantile levels into a multiclass classification problem by discretizing the target variable into bins and interpolating the cumulative sum of the bin membership probability to estimate the CDF of the distribution of the continuous target variable conditioned on the features.

Ideally, the classifier should be efficient when trained on a large number of classes (induced by the number of bins). Therefore we use a Random Forest classifier as the default base estimator.

from scipy.interpolate import interp1d
from sklearn.base import BaseEstimator, RegressorMixin, clone
from sklearn.utils.validation import check_is_fitted
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.utils.validation import check_consistent_length
from sklearn.utils import check_random_state
import numpy as np


class BinnedQuantileRegressor(BaseEstimator, RegressorMixin):
    def __init__(
        self,
        estimator=None,
        n_bins=100,
        quantile=0.5,
        random_state=None,
    ):
        self.n_bins = n_bins
        self.estimator = estimator
        self.quantile = quantile
        self.random_state = random_state

    def fit(self, X, y):
        # Lightweight input validation: most of the input validation will be
        # handled by the sub estimators.
        random_state = check_random_state(self.random_state)
        check_consistent_length(X, y)
        self.target_binner_ = KBinsDiscretizer(
            n_bins=self.n_bins,
            strategy="quantile",
            subsample=200_000,
            encode="ordinal",
            random_state=random_state,
        )

        y_binned = (
            self.target_binner_.fit_transform(np.asarray(y).reshape(-1, 1))
            .ravel()
            .astype(np.int32)
        )

        # Fit the multiclass classifier to predict the binned targets from the
        # training set.
        if self.estimator is None:
            estimator = RandomForestClassifier(random_state=random_state)
        else:
            estimator = clone(self.estimator)
        self.estimator_ = estimator.fit(X, y_binned)
        return self

    def predict_quantiles(self, X, quantiles=(0.05, 0.5, 0.95)):
        check_is_fitted(self, "estimator_")
        edges = self.target_binner_.bin_edges_[0]
        n_bins = edges.shape[0] - 1
        expected_shape = (X.shape[0], n_bins)
        y_proba_raw = self.estimator_.predict_proba(X)

        # Some might stay empty on the training set. Typically, classifiers do
        # not learn to predict an explicit 0 probability for unobserved classes
        # so we have to post process their output:
        if y_proba_raw.shape != expected_shape:
            y_proba = np.zeros(shape=expected_shape)
            y_proba[:, self.estimator_.classes_] = y_proba_raw
        else:
            y_proba = y_proba_raw

        # Build the mapper for inverse CDF mapping, from cumulated
        # probabilities to continuous prediction.
        y_cdf = np.zeros(shape=(X.shape[0], edges.shape[0]))
        y_cdf[:, 1:] = np.cumsum(y_proba, axis=1)
        return np.asarray([interp1d(y_cdf_i, edges)(quantiles) for y_cdf_i in y_cdf])

    def predict(self, X):
        return self.predict_quantiles(X, self.quantile).ravel()